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Abstract 


There  are  several  different  types  of  parallel  computer  architectures  in  use  today.  Some  of 
these  are  large  machines  that  house  hundreds  of  processors  with  low-  to  mid-range  computing 
capabilities.  A  different  type  of  parallel  computer  architecture  becoming  increasingly  popular 
is  that  of  the  cluster.  Clusters  are  basically  networked  workstations:  each  containing  1  to  <~30 
processors  with  mid-  to  high-range  computing  capabilities.  While  arguments  can  be  made  for 
both  paradigms,  clusters  seem  to  be  gaining  in  popularity.  They  provide  fast  computation 
through  multiplicity  and  fast  processor  throughput.  Furthermore,  code  reuse  on  different  cluster 
environments  is  now  possible  with  die  adoption  of  standard  inteiprocessor  communication  tools 
like  the  message-passing  interface  (MPI)  and  high-performance  FORTRAN  (HPF).  This  report 
compares  and  contrasts  the  performance  of  MPI  and  HPF  on  a  currently  available  computer 
cluster. 
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Conventions. 

Typewriter  font  is  used  for: 

•  any  output  directly  produced  by  the  computer  (this  includes  program  results 
and  operating  system  messages),  and 

•  program  source  code  listings. 
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1  Introduction. 


The  message-passing  interface  (MPI)  and  high-performance  FORTRAN  (HPF)  are 
two  sets  of  extensions  to  programming  languages  that  seek  to  provide  architecture- 
independent  parallelism.  MPI  extensions  exist  for  both  the  C  and  FORTRAN  pro¬ 
gramming  languages.  HPF  is  specifically  tailored  to  augment  the  FORTRAN  90 
language.  Both  MPI  and  HPF  have  adopted  published  standards.  HPF  was  adopted 
in  1993  [1],  and  MPI  was  standardized  in  1994  [2].  These  standards  have  allowed 
codes  to  be  easily  ported  between  different  computer  architectures  with  a  minimum 
of  code  rewriting. 

These  two  systems  oj0fer  quite  different  parallel  computational  models.  A  compu¬ 
tation  in  MPI  usually  consists  of  one  or  more  processes  that  communicate  messages 
(sends  and  receives)  through  calls  to  library  routines.  These  libraries  are  ideally  cus¬ 
tomized  and  optimized  for  the  specific  hardware  system  on  which  the  computations 
are  being  performed.  As  long  as  the  programs  contain  standard  MPI  directives,  they 
should  be  able  to  be  ported,  recompiled,  and  executed  on  different  architectures  using 
the  MPI  system.  MPI  is  sometimes  referred  to  as  the  “assembly  language”  of  parallel 
programming,  since  the  programmer  is  required  to  specify  the  parallelism  explicitly 
through  calls  to  the  message-passing  library.  This  is  often  classified  as  an  advantage 
rather  than  a  hindrance.  Effective  cache  use  and  memory  management  are  becoming 
extremely  important  to  achieving  good  parallel  performance.  Message  passing  can 
help  in  this  regard  by  providing  more  programmer  control  of  data  locality  in  the 
memory  hierarchy  [3].  MPI  appears  to  be  surpassing  the  Parallel  Virtual  Machine 
(PVM)  interface  as  the  defacto  standard  in  message  passing  parallelism. 

In  contrast,  HPF  is  more  closely  associated  with  the  data-parallel  programming 
model.  Data  parallelism  attempts  to  exploit  the  concurrency  of  the  same  operation 
to  multiple  data  elements  [4].  For  example,  one  may  wish  to  add  the  value  10  to  each 
element  of  an  array.  This  operation  is  inherently  parallel,  since  there  are  no  data 
dependencies  inside  this  atomic  operation.  FORTRAN  90  supports  such  notations 
as  A  =  A  -I-  10  to  perform  this  operation,  where  A  can  be  a  multidimensional  array. 
HPF  augments  FORTRAN  90  with  directives  that  inform  the  compiler  that  there  are 
no  data  dependencies  and  the  code  region  is  data-parallel  safe.  It  is  up  to  the  HPF 
compiler  to  optimally  insert  communication  directives  between  processors  and  ensure 
synchronization  between  parallel  regions.  HPF  is  therefore  slightly  more  abstract 
than  MPI.  It  expresses  parallelism  at  a  relatively  high  level  and  is  intended  to  remove 
the  programmer  from  the  more  mundane  tasks  of  specifying  communication  behavior 
between  processors  [5]. 

Of  particular  interest  is  how  well  these  systems  perform,  both  in  general  and  com¬ 
pared  to  each  other,  in  current  computer  architectures.  The  two  basic  performance 
metrics  of  parallel  computing  systems,  speedup  and  efficiency,  were  studied  in  this 
experiment.  Scalability  is  an  interesting  topic  in  this  domain,  but,  as  it  relates  to 
this  problem,  only  provides  useful  data  for  algorithms  with  similar  floating-point  and 
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interprocessor  communication  requirements.  The  actual  speedup  achieved  is  defined 
as 

=  (1) 

J-p 

in  which  Ti  is  the  linear  (one  processor)  completion  time  for  the  algorithm,  and  Tp  is 
the  parallel  completion  time  using  p  processors.  The  second  metric  used  is  efficiency. 
Efficiency  is  defined  as 


Ep  = 


(2) 


This  number  indicates  the  overall  efficiency  of  the  p  processors  working  on  the  prob¬ 
lem.  Ideally,  this  number  should  be  as  close  to  100%  as  possible,  but  will  suffer 
because  of  conditions  of  load  imbalancing,  communication  costs,  and  various  other 
parallelization  overheads. 

The  cluster  architecture  used  in  this  experiment  is  quite  relevant  in  today’s  com¬ 
puting  environments.  These  coarse-grained  architectures  and  clusters  of  workstations 
are  becoming  extremely  popular  for  parallel  computations.  They  can  appear  in  many 
manifestations.  Indeed,  a  cluster  can  be  as  little  as  two  machines  networked  together. 
This  alone,  theoretically,  boosts  performance  by  a  factor  of  2.  At  the  other  end  of 
the  spectrum  are  very  powerful  hosts  connected  together  over  fast  data  channels, 
such  as  the  Silicon  Graphics  Power  Challenge  Array  (96  nodes)  at  the  U.S.  Army 
Research  Laboratory.  Three  factors  have  played  into  clusters  becoming  viable  par¬ 
allel  programming  platforms.  These  factors  are  workstation-level  high-performance 
microprocessors,  standardized  high-speed  communication,  and  reliable  standardized 
tools  for  distributed  computing  [6].  Today’s  workstations  have  processors  that  are 
quite  robust.  Very  few  of  them  need  to  be  coupled  to  deliver  impressive  parallel 
performance.  High-performance  networks,  such  as  ATM,  HiPPI,  and  FDDI  are  now 
capable  of  delivering  bandwidths  of  around  100  MB/s.  Standard  tools  for  synchro¬ 
nization  and  communication,  such  as  MPI  and  HPF,  continue  to  grow  and  mature 
into  trusted  systems. 


2  Experimental  Hypotheses. 

Computer  algorithms  usually  work  on  sets  of  data  structures.  Attacking  these  struc¬ 
tures  with  multiple  processors  is  what  provides  parallelism.  Theoretically,  if  the  data 
structures  used  are  of  size  n,  then  n  processors  could  be  used  to  perform  atomic  oper¬ 
ations  in  parallel.  Breaking  the  problem  down  to  find  this  lowest  level  of  parallelism 
is  known  as  decomposition.  However,  seldom  do  such  large  numbers  of  processors 
exist.  For  example,  say  we  have  an  integer  array  B  that  is  two-dimensional  (2-D)  and 
of  size  1,000  X  1,000.  While  an  operation  B  =  B  -I-  1  is  possible  in  parallel,  rarely 
do  machines  have  one  million  processors.  Therefore,  the  data  must  be  agglomerated 
and  distributed  among  the  number  of  processors  available.  Several  strategies  and 
mechanisms  are  available  for  this  task.  HPF  provides  some  standard  decompositions. 
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while  MPI  requires  the  programmer  to  instrument  the  code  to  achieve  the  desired 
result.  While  a  discussion  of  the  numerous  decompositions  is  not  the  purpose  of 
this  paper,*  a  quick  example  is  helpful.  Two  agglomerations  are  quite  popular  in 
coarse-grained  architectures  where  a  minimum  of  communications  is  desired.  These 
are  one-dimensional  (1-D)  row  and  1-D  column  decompositions  and  are  are  pictured 
in  Figure  1.  This  figure  shows  the  case  of  a  2-D  array  on  a  four-processor  computer. 


1  -D  Row  Decomposition  1  -D  Column  Decomposition 

Figure  1:  A  2-D  Array  Decomposed  by  Rows  and  Columns. 

The  2-D  array  is  blocked  by  rows  in  the  row  decomposition  and  by  columns  in  the 
column  decomposition.  Processor  1  (PI)  gets  the  first  block  of  data,  processor  2 
(P2)  the  next,  and  so  on.  Data  distribution  is  usually  done  to  try  and  limit  costly 
interprocessor  communication.  Other  decompositions  are  available,  but  these  two  are 
quite  prevalent  in  coarse-grained  architectures.  All  of  this  depends,  however,  on  the 
algorithm’s  action  inside  the  data  matrix. 

Several  decompositions  and  agglomerations  were  studied  in  this  experiment.  MPI 
requires  data  to  be  distributed  to  the  individual  processors  explicitly  by  the  program¬ 
mer.  HPF  contains  directives  that  specify  how  the  data  should  be  distributed  by  the 
compiler.  These  computations  and  tests  were  performed  on  a  Digital  Equipment 
Corporation  (DEC)  Alpha  network  with  individual  machines  being  connected  by  an 
FDDI  network.  The  following  hypotheses  were  investigated. 

(1)  One-dimensional  row  decomposition  using  MPI  and  1-D  column  decomposition 
using  HPF  will  result  in  the  fastest  speedups  for  this  algorithm  in  the  two  dif¬ 
ferent  parallel  programming  environments.  Row  decomposition  in  C  has  shown 
to  be  slightly  faster  in  studies  of  finite  difference  algorithms.  This  seems  to  be 
because  the  memory  is  laid  out  in  row  major  order.  In  FORTRAN,  one  can 
expect  column  major  order  accessed  by  columns  to  be  faster. 

(2)  There  should  be  little  difference  between  the  fastest  decomposition  and  agglom¬ 
eration  with  MPI  and  the  fastest  similar  decomposition  with  HPF.  The  main 

‘Complete  examples  can  be  found  in  [4]  and  [5]. 
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limiting  factor  of  both  should  be  the  speed  of  the  interconnection  network.  Ide¬ 
ally,  both  should  make  the  same  optimal  use  of  this  system.  FORTRAN  code 
lias  shown  itself  to  be  faster  in  some  algorithms,  but  the  difference  in  the  par¬ 
allel  codes  using  MPI  and  HPF  should  be  no  more  than  that  noticed  in  the 
sequential  algorithms. 

(3)  Based  on  studies  of  finite  difference  codes  and  the  author’s  prior  experience  with 
other  distributed  memory,  message-passing  environments,  an  overall  efficiency 
of  about  85%  is  all  that  can  be  expected  from  the  MPI  and  HPF  codes. 

3  Case  Study:  Image  Matching. 

3.1  The  Image-Matching  Algorithm. 

The  eye-brain  system  achieves  three-dimensional  (3-D)  depth  perception  by  taking 
advantage  of  two  separate  and  distinct  images  captured  by  each  eye.  The  image 
captured  by  the  left  eye  is  slightly  different  than  the  one  captured  by  the  right  eye. 
This  difference  is  called  retinal  disparity,  and  the  brain  is  able  to  quickly  use  this 
information  and  other  binocular  cues  to  compute  depth  of  objects  in  what  the  eyes 
are  seeing. 

Computers  and  machines  can  also  determine  depth  of  objects  in  their  environ¬ 
ments,  but  not  quite  as  easily.  One  method  involves  using  active  sensors,  or  lasers,  to 
determine  range  information.  This  method  is  limited  in  that  lasers  can  usually  only  be 
directed  at  certain  distinct  points,  thus  limiting  the  machine’s  ability  at  determining 
range  information  in  the  entire  scene. 

A  more  interesting  approach  is  to  have  the  computer  mimic  the  brain’s  behavior. 
This  approach  is  known  as  computational  stereo  vision.  A  stereo  camera  pair  is  used 
to  take  pictures  of  a  scene  where  range  information  is  required.  Because  of  binocular 
parallax,  these  cameras  will  acquire  slightly  different  images  of  the  scene,  since  they 
are  at  diflferent  locations.  Points  very  distant  from  the  cameras  will  appear  to  be 
at  almost  the  same  vertical  and  horizontal  positions  on  digitized  images  from  the 
cameras.  Objects  closer  to  the  cameras  will  be  more  displaced.  This  phenomenon 
can  be  seen  in  Figure  2.  Notice  that  points  distant  in  the  picture  are  at  roughly  the 
same  location  in  both  left  and  right  images.  However,  points  close  to  the  camera, 
such  as  the  measured  white  spot  on  the  highway,  have  a  much  greater  disparity  in 
the  two  images. 

By  having  the  computer  determine  which  points  match  in  the  two  images,  and  then 
computing  the  amount  the  points  shift,  the  computer  is  mimicing  the  retinal  disparity 
computations  performed  by  the  brain.  The  mathematics  involve  only  trigonometry; 
however,  the  number  of  transformations  for  high-resolution  images  becomes  stagger¬ 
ing.  Every  point  is  matched,  thus  giving  a  fine  grid  of  range  information  that  dwarfs 
the  capabilities  of  active  sensors.  The  main  drawback  of  this  approach,  however,  is 
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Figure  2:  Depth  Determined  by  Disparity  in  Images  (ruler  bars  are  not  to  scale). 

the  time  it  takes  to  match  the  points  in  the  two  images. 

The  algorithm  to  perform  computational  stereo  is  stochastic;  therefore,  it  does 
require  some  time  to  complete.  It  is  an  undirected  Monte  Carlo  search  through  the 
image  space  that  produces  a  very  fine,  globally  optimized  disparity  map  where  every 
pixel  in  one  image  is  matched  with  its  corresponding  pixel  in  the  stereo  pair.  As 
with  other  Monte  Carlo  algorithms,  this  approach  requires  a  significant  number  of 
fioating-point  operations.  However,  the  process  of  matching  pixels  typically  requires 
only  local  interactions.  On  the  computer,  this  translates  into  local  references  to 
memory  by  applying  a  nine-point  stencil  to  the  2-D  digitized  images.  Furthermore, 
besides  certain  boundary  conditions,  the  amount  of  processing  for  each  pixel  remains 
uniform.  These  properties  make  the  algorithm  ideal  for  parallel  processing. 

Stereo  matching  requires  global  optimization.  Since  the  digital  image  data  maps 
pixel  intensities  to  a  relatively  low  resolution  (typically  8  bits,  implying  256  discrete 
levels),  there  are  many  possible  matches  in  the  local  sense.  Swatches  in  one  image 
may  appear  to  map  other  portions  of  the  stereo  image  pair.  To  perform  this  operation 
accurately,  the  entire  image  must  be  taken  into  account. 

One  of  the  most  popular  optimization  techniques  to  locate  a  global  optimum  is 
called  simulated  annealing.  This  approach  can  be  applied  directly  to  the  image¬ 
matching  problem  [7].  As  the  name  implies,  the  approach  imitates  a  natural  process. 
Annealing  involves  heating  a  solid  to  the  extent  that  the  molecules  may  randomly 
rearrange  themselves  and  then  cool  gradually.  Slowly  lowering  the  temperature  allows 
the  molecules  to  settle  into  the  lowest  energy  state,  commonly  described  as  thermal 
equilibrium.  If  the  temperature  rate  declines  too  fast,  defects  may  become  frozen  into 
the  end  state.  If  thermal  equilibrium  is  maintained  throughout  the  cooling  cycle,  the 
final  system  should  be  a  globally  optimized  structure.  For  example,  perfect  crystals 
are  grown  in  this  manner. 

The  simulated  annealing  technique  is  outlined  in  Figure  3.  The  system  is  taken  to 
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read  Ir,  II 

D{row,  col)  =  random  number  in  [0 . . .  Dm  ax] 

T  =  Tmax 

/*  loop  according  to  fixed  annealing  schedule  */ 
while  T  >  Tmjn 

5'  <=  random  state  change  S 
AE  =  E(5')  ~  EiS) 

/*  accept  lower  energy  states  */ 

if  AE  <  0  then  5  =  5' 

else 

P  - 

j  =  random  number  in  [0 ...  1] 

/*  accept  higher  energy  only  with  Boltzman  probability  */ 

if  j  <  P  then  5  =  5' 
reduce  T  by  predefined  percentage 
end  while 


Figure  3:  Simulated  Annealing  Algorithm  in  Pseudocode. 

equilibrium  by  the  Metropolis  algorithm  by  considering  random,  local  state  transitions 
on  the  basis  of  the  change  in  energy  that  they  imply.  Since  the  system  is  stochastic, 
these  local  state  changes  can  take  the  system  away  from  convergence  as  well  as  toward 
it.  This  helps  to  prevent  the  system  from  sinking  into  local  minima.  The  processing 
is  complete  when  the  system  is  in  equilibrium  at  the  lowest  energy  state  achievable. 
A  more  detailed  discussion  of  this  technique  and  its  Army  applications  may  be  found 
in  [8]. 

The  final  result  of  the  algorithm  is  a  2-D  disparity  map.  The  values  in  the  dis¬ 
parity  map  are  integer  values  ranging  from  0  (no  disparity)  to  D-MAX  (maximum 
disparity).  To  better  interpret  the  map,  these  values  are  coded  with  gray  scale  values 
and  written  to  binary  data  files.  Examples  of  these  encoded  disparity  maps  appear 
later  in  this  paper. 

3.2  Testing  MPI  and  HPF  for  Correctness. 

The  stereo-matching  algorithm  was  tested  with  several  computer-generated  random 
dot  stereograms.  These  stereograms  represent  synthetic  3-D  objects.  In  this  case 
we  simulate  a  camera  system  looking  down  on  a  “wedding  cake”  structure.  Figure  4 
shows  the  3-D  representation  of  this  four-tiered  structure.  The  stereogram  is  created 
by  starting  with  a  solid  black  background.  It  is  then  speckled  with  randomly  placed 
white  pixels.  The  number  of  white  pixels  is  limited  to  10%  of  the  total  image  to  test 
the  robustness  of  the  algorithm.  The  hypothetical  right  camera  of  the  two-camera 
system  is  assigned  this  random  dot  image.  Since  the  object  is  tiered,  a  stereo  camera 
system  above  and  facing  the  object  would  consist  of  two  cameras  (a  right  and  left 
camera)  that  would  perceive  the  dots  to  be  at  different  locations  in  the  two  cameras. 
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Figure  4;  3-D  Wedding  Cake  Structure. 


This  effect  is  simulated  by  creating  the  left  image  of  the  stereo  pair  by  shifting  pixels 
in  the  right  image  to  the  right.  Pixels  around  the  outer  edge  were  not  offset,  the  next 
level  in  was  offset  by  two  pixels,  the  next  level  four  pixels,  and  the  center  was  offset 
by  six  pixels.  Pixels  with  high  movement  represent  areas  that  would  be  close  to  a 
camera  looking  down  from  above  the  wedding  cake  whereas  pixels  with  no  disparity 
would  be  distant  from  the  camera.  Figure  5  shows  the  random  dot  stereo  pair. 
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Figure  5:  Random  Dot  Stereogram.  The  Left  Image  Is  Formed  by  Displacing  Points 
in  the  Right  Image. 

Because  of  the  well-defined  disparity  maps,  these  random  dot  images  represent 
ideal  cases  for  evaluating  stereo-matching  algorithms.  That  is,  we  know  how  the 
result  should  look,  whereas  in  a  real-world  image,  there  would  be  some  doubt  as  to 
what  an  exact  map  should  look  like.  There  are  still  some  areas  of  ambiguity,  usually 
a  result  of  sections  that  are  devoid  of  white  pixels;  however,  the  overall  structure  of 
the  map  remains  clear.  The  algorithm  was  checked  for  correctness  and  validated  in 
all  test  cases.  As  an  example.  Figure  6  shows  a  gray  scale  encoded  disparity  map 
representing  a  solution  to  the  matching  problem  from  sequential  C  code  and  parallel 


code  using  the  MPI  system.  The  actual  2-D  result  disparity  map  contains  values 
in  the  range  0, . . . ,  6.  To  better  interprete  the  results,  the  final  disparity  maps  were 
gray-scale  encoded.  Lighter  areas  in  the  image  represent  areas  close  to  the  camera; 
darker  areas  are  further  away. 

4  Implementation  Details. 

4.1  MPI. 

Implementation  of  the  MPI  version  of  the  code  was  straightforward.  The  mpicc 
compiler  on  the  DEC  Alpha  was  used  with  linkage  to  the  MPI  libraries  for  access  to 
the  communication  directives.  The  default  optimization  (-0)  was  used  to  correspond 
with  the  optimization  used  for  the  sequential  code  version.  This  ensures  similar  code 
generation  and  allows  for  meaningful  comparisons  between  the  sequential  and  parallel 
code  versions. 

One  dimensional  decompositions  by  rows  and  columns  were  used.  The  master 
process  is  in  charge  of  opening  and  reading  the  left  and  right  image  data  files.  Since  the 
disparity  map  is  randomized  at  the  start  of  the  simulation,  each  processor  initializes  its 
own  section  of  the  disparity  map.  Image  data  never  change  during  the  program,  so  this 
static  data  is  distributed  once  to  each  processor  at  the  start  of  the  program.  Boundary 
conditions  exist  since  a  nine-point  stencil  is  being  passed  over  the  2-D  arrays.  For 
example,  with  the  row  decomposition  shown  in  Figure  1,  processor  1  requires  the 
image  data  from  the  first  row  governed  by  processor  2.  Also,  processor  2  requires  the 
last  row  of  the  image  data  owned  by  processor  1.  Therefore,  a  processor  also  gets 
some  of  the  image  data  that  belongs  to  its  neighboring  processor  at  the  beginning. 
However,  since  disparity  data  is  dynamic,  it  must  be  distributed  between  processors  at 
each  loop  iteration.  A  nine-point  stencil  is  used  in  this  algorithm  because  each  point 
needs  to  know  the  disparity  value  of  its  eight  neighboring  points.  This  causes  the 
creation  of  boundary  conditions,  or  shadow  points,  along  the  processor  boundaries. 
These  shadow  points  define  the  interprocessor  communication  requirements  for  the 
computation. 

Boundary  conditions  for  the  column  decomposition  are  slightly  more  involved  in 
terms  of  image  data.  Since  the  algorithm  assumes  a  perfect  image  in  the  vertical  ori¬ 
entation  (no  disparity  up  or  down) ,  we  are  only  concerned  with  finding  the  horizontal 
disparity.  The  photometric  component  in  the  energy  function  uses  the  difference  be¬ 
tween  the  intensities  of  the  proposed  matched  points  in  the  left  and  right  images. 
The  point  in  the  left  image,  however,  can  match  a  point  in  the  right  image  up  to 
D-MAX  pixels  to  the  right.  For  example,  assume  an  image  of  size  256  x  256  with 
four  processors  working  on  the  problem.  Processor  1,  with  a  column  decomposition, 
will  have  image  data  from  rows  0, . . . ,  255  and  columns  64, ... ,  127  of  the  left  and 
right  images.  It  could  be  possible  that  the  point  in  the  left  image  (row  =  100,  column 
=  127)  matches  with  the  point  in  the  right  image  (100,  132).  In  this  case,  processor  1 
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does  not  have  all  of  the  data  about  the  right  image  that  it  requires.  To  compensate  for 
this  in  the  column  decompositions,  each  processor  receives  DJAAX  extra  columns 
of  the  right  image  from  the  master  at  the  start  of  the  algorithm. 

The  code  was  tested  on  random  dot  stereograms  of  sizes  200  x  200,  400  x  400, 
600  X  600,  and  800  x  800  with  two,  four,  six,  and  eight  processors.  All  combinations 
which  were  tested  completed  successfully.  Figure  6  shows  the  200  x  200  gray  scale 
result  maps  for  the  sequential  case,  the  MPI  row  decomposition,  and  the  MPI  column 
decomposition.  The  MPI  results  were  generated  in  each  case  using  four  processors. 


Figure  6:  From  Left  to  Right,  Results  From  the  Sequential  C  Program,  MPI  Row 
Decomposition,  and  MPI  Column  Decomposition.  The  Results  Are  Slightly  Different 
for  Each  Case  Given  the  Random  Nature  of  the  Simulations. 


4.2  HPF. 

Many  difficulties  were  experienced  while  trying  to  move  this  code  into  FORTRAN 
90  and  into  HPF.  The  HPF  compiler  on  the  DEC  Alpha  computers  has  not  yet 
evolved  to  the  stage  of  being  a  fully-functional  HPF  compiler.  To  begin  with,  sev¬ 
eral  important  FORTRAN  90  constructs,  as  well  as  critical  HPF  constructs  are 
not  supported.  The  compiler  currently  does  not  generate  code  tailored  to  the 
!HPF$  INDEPENDENT  directive.  Here  is  the  compiler  error  message: 

f90:  Warning:  rows. f 90,  line  148:  The  INDEPENDENT  directive  is  checked  for 
syntactic  and  semantic  correctness,  but  it  is  then  ignored  by  the  current 
HPF  compiler. 

!HPF$  INDEPENDENT 

This  directive  should  be  used  around  areas  that  are  data-parallel  safe  (contains  no 
dependencies)  to  assist  the  compiler  in  parallelizing  the  section.  These  directives  were 
removed  from  the  code  to  reduce  the  number  of  warning  messages  produced  by  the 
compiler.  With  a  compiler  that  does  support  them,  these  directives  should  precede  all 
of  the  FORALL  statements  in  the  simulated  annealing  code.  Furthermore,  WHERE 
statements  may  not  be  located  inside  of  FORALL  loops.  This  lack  of  functionality 
requires  two  FORALL  constructs  where  often  only  one  with  a  WHERE  statement 
should  be  needed. 
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Data  distribution  directives  in  HPF  are  used  to  specify  array  data  distribution  to 
the  processors  available  in  the  processor  pool.  As  in  MPI,  two  distribu¬ 
tions  were  used,  1-D  row  and  1-D  column.  The  first  distribution  used  was 
!HPF$  DISTRIBUTE(BLOCK,  *)  ::  dMap.  This  distributes  the  “dMap”  array  data 
in  a  row-blocked  format,  similar  to  the  one  used  in  MPI.  The  other  distribution  tested 
was  (*,  BLOCK)  which  corresponds  to  column  decomposition.  The  rest  of  the  book¬ 
keeping  and  computational  storage  arrays  (there  were  several)  were  aligned  with  the 
dMap  array.  There  were  several  problems  in  getting  HPF  to  work  with  this  code. 
Several  times,  the  executables  would  get  hung  up  and  then  exit  with: 

TCP.MsgReadMsg:  read  error  54ows:  ERROR  PeerCO]  (38)  _TCP_Send:  send  length 
error  -  errno  9  (msgsend.c  Line: 377) 

ows:  ERROR  Peer [2]  (38)  _TCP_Send:  send  length  error  -  ermo  9  (msgsend.c 
Line: 377) 

ows:  ERROR  Peer [3]  (38)  _T(a>_Send:  send  length  error  -  ermo  9  (msgsend.c 
Line: 377). 

The  (BLOCK,  *)  distribution  produced  the  following  error  during  runtime, 
forrtl:  error  (72):  floating  overflow 

TCP.MsgReadMsg:  read  error  54ows:  ERROR  Peer[0]  (29)  _TCP_RecvAvail :  Unexpected 
EOF  from  peer  0  (msgrecva . c  Line : 189)  . 

There  appear  to  be  no  situations  in  the  code  that  could  cause  such  an  error.  Indeed, 
the  type  for  the  real  array  used  in  comparison  to  the  exponential  function  was  changed 
to  double  precision  with  the  same  results.  Furthermore,  the  code  performs  fine  with 
identical  typing  in  the  sequential  version  and  in  the  HPF  version  in  -single  mode. 
Compiling  the  (*,  BLOCK)  distribution  produced  the  following  warnings: 

f90:  Warning:  compute_loop_carried_sets :  do.tree  construct  dt.group^forall 
not  handled 

f90:  Warning:  compute^loop.carried^sets :  do^tree  construct  dt_group„forall 
not  handled 

f90:  Warning:  make^edges:  do.tree  construct  dt_group_forall  not  handled 
f90:  Warning:  make_edges:  do^tree  construct  dt^group^forall  not  handled. 

The  initial  belief  was  that  having  multiple  statements  inside  a  FORALL  loop  was  the 
cause  for  these  cryptic  messages.  However,  the  code  does  work  properly  when  imple¬ 
mented  on  one  processor.  Also,  HPF  documentation  states  that  multiple  statements 
are  allowed  in  FORALL  loops  and  that  they  are  performed  in  order.  This  should 
preclude  any  data  dependencies. 

Currently,  there  does  not  appear  to  be  a  good  explanation  for  most  of  these  warn¬ 
ings  nor  a  good  understanding  of  their  intended  meanings.  Other  distributions  were 
attempted  (e.g.,  CYCLIC),  and  each  experienced  some  problem  like  those  previously 
listed. 
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5  Experimental  Results  and  Interpretation. 

5.1  MPL 

Originally,  the  sequential  code  for  the  algorithm  was  written  in  C  and  compiled  using 
the  mpicc  compiler.  However,  this  produced  executables  that  were  slower  than  the 
MPI  version  with  one  processor.  This  resulted  in  dubious  occurrences  of  superlinear 
speedup.  A  possible  explanation  resides  in  the  compile  sequence  generated  by  the 
makefile  that  was  used.  The  makefile  performed  the  following  operations: 

mpicc  -DFORTRANDNDERSCORE  -DMPE_USE_EXTENSI0NS=1  -DHAS_XDR=1  -DSTDC_HEADERS=1 
-DHAVE_STDLIB_H=1  -DMALL0C_RET_V0ID=1  -DHAVE_SYSTEM=1  -DHAVE_NICE=1 
-DP0INTER_64_BITS=1  -DINT_LT_P0INTER=1  -DHAVE_LONG_DOUBLE=1 
-DHAVE_L0NG_L0NG_INT=1  -0  -DHPI.alpha  -c  mpirows.c 

mpicc  -0  -o  mpirows  mpirows.o  -L/usr/local/mpi-vl.0.12/lib/alpha/ch_p4  -Impi 

-Im. 

Undoubtedly,  one  of  these  options  caused  the  production  of  more  efficient  code. 
Therefore,  to  make  the  process  more  homogeneous,  the  sequential  version  used  in 
these  comparisons  is  actually  augmented  with  the  MPI  initializations  and  ran  through 
MPI  with  one  processor.  Furthermore,  to  ensure  a  true  refiection  of  speedup  and  ef¬ 
ficiency,  Porsche,  the  faster  processor  of  the  group  of  DEC  Alphas,  was  not  used  for 
these  sequential  timings.  The  sequential  execution  times  for  the  different  problem 
sizes  are  given  in  Table  A.l.  The  code  for  this  version  of  the  algorithm  is  given  in 
Appendix  B.  This  is  not  a  production  code;  therefore,  comments  and  documentation 
within  the  code  listings  are  sparse.  The  same  caveat  goes  for  all  code  listed  in  the 
appendices.  The  code  for  the  row  decomposition  in  MPI  is  given  in  Appendix  C  and 
the  code  for  the  column  decomposition  is  given  in  Appendix  D.  The  execution  times 
for  the  row  decompositions  and  column  decompositions  are  given  in  Tables  A.2  and 
A.3,  respectively. 

The  speedup  achieved  by  the  MPI  row  decomposition  is  shown  in  Figure  7,  and 
the  speedup  achieved  by  the  MPI  column  decomposition  is  shown  in  Figure  8.  To  dis¬ 
tribute  the  noncontiguous  C  data  that  results  from  column  decompositions,  a  derived 
data  type  had  to  be  created  in  MPI.  MPI  sends  data  based  on  addresses,  which  are  row 
major  in  C.  Trying  to  send  data  in  column  major  order  is  undoubtedly  simply  adding 
more  buffering  behind  the  scenes.  The  speedups  achieved  by  row  decomposition  are 
faster  than  the  speedups  for  the  column  decomposition  in  every  case. 

The  efficiencies  corresponding  to  row  and  column  decompositions  are  shown  in 
Figures  9  and  10,  respectively.  The  efficiencies  are  very  good.  The  average  row 
decomposition  efficiency  is  around  94%.  As  more  processors  are  invoked,  one  can  see 
the  efficiencies  start  to  decline  slightly.  This  is  a  standard  phenomena.  The  code  will 
only  run  as  fast  as  the  slowest  processor  in  the  pool.  When  using  more  processors, 
the  individual  load  averages  of  the  processors  has  a  direct  effect  on  efficiency.  Still, 
when  94  out  of  every  100  instructions  is  working  on  the  problem,  the  system  has  been 
well  tuned. 
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MPI  Speedup  1-D  Row  Decomposition 
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Figure  7:  MPI  1-D  Row  Decomposition  Speedup. 

5.2  HPF. 

The  sequential  timings  were  generated  by  compiling  the  HPF  code  using  the  standard 
£90  compiler.  The  HPF  directives  are  ignored  by  the  compiler,  and  sequential,  one 
processor  code  is  generated.  There  was  a  problem  when  trying  to  perform  tests  on 
the  800  X  800  image  size.  Dynamic  array  allocation  was  used  for  each  array  other 
than  “dMap”  since  this  array  size  must  be  known  at  compile  time  so  the  other  arrays 
may  be  aligned  to  it.  The  reason  for  the  runtime  problems  could  not  be  determined, 
and  this  test  case  was  omitted  from  the  trials. 

The  code  to  perform  the  sequential,  as  well  as  the  row  and  column  decompositions 
in  HPF,  is  given  in  Appendix  E.  The  code  listed  performs  row  decomposition  by 
distributing  data  in  row  block  format.  The  column  distribution  was  done  by  altering 
the  HPF  directive  (BLOCK,  *)  to  (*,  BLOCK).  Sequential  code  was  generated  by 
invoking  the  f90  compiler  without  HPF  enabled.  Table  A.4  lists  the  sequential  timing 
results. 

Getting  good  results  from  HPF  was  almost  impossible.  Data  distribution  direc¬ 
tives  should  affect  only  communication  costs,  not  correctness.  However,  the  directives 
did  determine  if  the  code  would  even  execute  or  not.  Each  distribution  performed 
properly  when  using  one  processor  (executed  with  -single).  Therefore,  the  following 
tables  do  list  the  times  for  one  processor  as  well.  Table  A. 5  lists  the  results  for  the 
row  decomposition.  Errors  are  noted  in  the  table.  Overflow  means  a  floating-point 
overflow  occurred.  Exit  means  the  program  terminated  prematurely  with  no  warning 
or  error  message  posted.  EOF  reflects  a  processor  that  received  an  unexpected  end- 
of-file  from  another  processor.  Read  errors  indicate  an  error  occurred  while  trying  to 
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Figure  8:  MPI  1-D  Column  Decomposition  Speedup. 


Figure  9:  MPI  1-D  Row  Decomposition  EflSciency. 

read  from  a  processor.  Host  indicates  one  of  the  hosts  timed-out  or  died.  Table  A.6 
lists  the  results  for  the  column  decomposition. 

Why  exactly  there  were  so  many  problems  with  HPF  remains  a  mystery.  In  cases 
where  the  code  did  perform,  it  did  not  perform  well.  There  are  several  possible  ex¬ 
planations.  There  could  be  too  much  overhead.  This  could  be  because  of  the  way 
the  single-program,  multiple-data  (SPMD)  programs  are  created.  Also,  it  might  be 
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MPI  Efficiency  1-D  Column  Decomposition 


Figure  10:  MPI  1-D  Column  Decomposition  Efficiency. 

that  there  is  not  enough  work  in  the  parallel  regions.  The  cost  of  repeatedly  calling 
the  FORALL  directive  may  be  too  high  for  the  runtime  system.  Another  possible 
reason  is  poor  communication;  possibly  due  to  a  poor  communication  library  or  un¬ 
optimized  communication  code.  Unfortunately,  in  many  cases  it  seemed  that  the 
compiler  simply  gave  up  and  did  not  generate  any  communications  upon  reaching 
unsupported  directives.  The  single  processor  code  with  parallel  HPF  directives  ac¬ 
tivated  was  profiled.  The  results  are  given  in  Table  A. 7.  While  tracing  down  exact 
problems  was  difficult,  it  is  evident  from  the  overhead  time  that  this  was  the  major 
cause  of  degraded  parallel  performance. 


6  Conclusion. 

The  experiments  and  testing  did  not  entirely  support  the  hypotheses.  Indeed,  1-D 
row  decomposition  in  MPI  was  faster  than  its  column  decomposition  counterpart. 
This  is  probably  because  of  the  way  MPI  performs  the  derived  data  type.  Copies  to 
buffers  to  implement  the  strided  column  send  will  easily  account  for  this  difference. 
There  is  not  enough  good  experimental  data  to  comment  on  the  HPF  distributions. 
During  implementation  of  the  algorithm  in  HPF,  it  was  realized  that  the  hypothesis 
that  column  decomposition  would  be  faster  than  row  decomposition  may  be  incorrect. 
Just  as  with  MPI,  the  decomposition  by  columns  does  pose  complications  for  HPF. 
When  the  data  is  decomposed  by  rows,  each  processor  has  all  of  the  image  data  that  is 
required  of  it.  By  columns,  however,  boundary  conditions  will  have  a  greater  impact, 
since  the  image  data  held  by  one  processor  might  be  required  by  its  neighbor. 

As  far  as  there  being  little  difference  between  the  fastest  MPI  implementation 
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and  the  fastest  HPF  implementation,  the  results  very  clearly  show  that  this  is  not 
the  case.  Comparisons  between  MPI  and  HPF  prove  almost  impossible  given  the 
very  poor  performance  of  HPF.  It  is  interesting  to  note,  however,  the  much  faster 
processing  times  achieved  sequentially  by  FORTRAN  90  compared  to  C. 

The  efficiencies  achieved  by  MPI  were  quite  good.  Prior  experience  with  message¬ 
passing  environments,  such  as  C-Linda,  and  indeed  some  shared  memory  parallel 
systems  such  as  the  SGI,  seemed  to  indicate  a  threshold  of  about  80-85%  efficiency 
that  could  be  expected  of  this  type  of  code  with  multiple  messages  being  passed 
at  boundary  conditions.  While  not  tightly  synchronized,  the  processors  do  have  to 
operate  in  a  lock-step  type  fashion,  since  updated  boundary  data  must  be  computed 
and  communicated  at  each  iteration.  Efficiencies  around  94%  in  these  cases  indicate 
that  MPI  has  been  well  thought  out  and  implemented  on  this  architecture. 

The  HPF  system  cannot  currently  compare  to  the  MPI  system  for  robustness 
and  speed  when  operating  on  the  DEC  Alpha  system.  This  seems  to  be  the  case,  in 
general,  as  MPI  is  more  quickly  gaining  acceptance  as  the  message-passing  system 
to  use  in  cluster  environments.  MPI  is  a  very  good  performer  and  should  only  get 
better  as  incremental  changes  and  enhancements  are  made  in  the  communication 
library.  HPF  will  undoubtedly  do  better  on  stable,  large  array  computations.  This 
is  usually  the  case  with  data  parallel  languages.  HPF  can  improve  dramatically  by 
simply  issuing  better  warnings  and  errors  at  both  compile  and  run  time.  The  HPF 
compiler  will  have  to  undergo  major  enhancements  in  its  code  generation  section  to 
be  considered  useful. 
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A  Tables 

A.l  C  Sequential  Results.  Times  Are  Listed  in  Seconds. 


Image 

Size 

Time 

to 

o 

o 

X 

200 

81.87 

400 

X 

400 

339.48 

600 

X 

600 

765.22 

800 

X 

800 

1365.54 

A. 2  MPI  Row  Decomposition  Results.  Times  Are  Listed  in 
Seconds. 


Image  Size 

Number  Processors 

2 

4 

6 

8 

200  X  200 

42.67 

22.56 

15.38 

12.31 

400  X  400 

175.44 

90.65 

60.24 

45.77 

600  X  600 

397.79 

202.45 

136.77 

101.83 

800  X  800 

711.12 

361.91 

242.29 

185.77 

A. 3  MPI  Column  Decomposition  Results.  Times  Are  Listed 
in  Seconds. 


Image  Size 

Number  Processors 

2 

4 

6 

8 

200  X  200 

48.39 

26.18 

18.19 

14.38 

400  X  400 

212.68 

102.88 

68.55 

52.77 

600  X  600 

446.54 

232.01 

155.99 

118.43 

800  X  800 

792.14 

409.80 

276.43 

208.44 
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A.4  FORTRAN  Sequential  Results.  Times  Are  Listed  in 
Seconds. 


Image  Size 

Time 

200  X  200 

400  X  400 
600  X  600 

33.03 

167.56 

392.93 

A.5  HPF  (BLOCK,  *)  Distribution  (Rows).  Times  Are  Listed 
in  Seconds. 


Image  Size 

Number  Processors 

1 

2 

4 

6 

8 

200  X  200 

75.59 

overflow 

overflow 

overflow 

overflow 

400  X  400 

317.98 

exit 

EOF 

overflow 

read 

600  X  600 

734.99 

exit 

EOF 

EOF 

EOF 

A.6  HPF  (*,  BLOCK)  Distribution  (Columns).  Times  Are 
Listed  in  Seconds. 


Image  Size 

Number  Processors 

1 

2 

4 

6 

8 

200  X  200 

77.70 

416.89 

528.35 

host 

400  X  400 

317.98 

exit 

exit 

exit 

host 

600  X  600 

734.99 

exit 

EOF 

EOF 

EOF 
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A. 7  Single  Processor  HPF  Profiling  Results. 


rnn-time  statistics 

( 

peer  0 
value  ) 

(value 

minimum 

•/.skew 

peer) 

(value 

maximum 

•/.skew 

peer) 

Timing  Info,  (sec) 
elapsed  time 

1008.23 

1008.23 

0.0 

0 

1008.23 

0.0 

0 

profiling  time 

251.50 

251.50 

0.0 

0 

251.50 

0.0 

0 

compute  time 

251.50 

251.50 

0.0 

0 

251.50 

0.0 

0 

comm,  time 

0.00 

0.00 

0.0 

0 

0.00 

0.0 

0 

active  time 

0.00 

0.00 

0.0 

0 

0.00 

0.0 

0 

idle  time 

0.00 

0.00 

0.0 

0 

0.00 

0.0 

0 

overhead  time 

779.34 

779.34 

0.0 

0 

779.34 

0.0 

0 

user  time 

705.32 

705.32 

0.0 

0 

705.32 

0.0 

0 

system  time 

298.34 

298.34 

0.0 

0 

298.34 

0.0 

0. 
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INTENTIONALLY  LEFT  BLANK. 
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B  C  sequential  code 


#iziclude  <aath.h> 

♦include  <stdio.h> 

♦include  <sys/tiine.h> 

♦include  <mpi.h> 

♦define  MAX.ROWS  1050 
♦define  MAX^COLS  1050 
♦define  D.MIN  0 
♦define  D^MAX  6 
♦define  LAMBDA  5 
♦define  STARTING.TEMP  100.0 
♦define  ENDING.TEMP  1.0 
♦define  TEMP.REDUCTION  0.1 
♦define  LATTICE.SCANS  10 

/*  Define  veurious  macros  and  functions  to  inline  code.  */ 

♦define  RANDOM.DISPARITY  (randO  7,  D.MAX) 

♦define  RAND OM.PROB ABILITY  (rand()  7,  32767  /  32767.0) 
♦define  min(x,y)  ((x)  <  (y))  ?  (x)  :  (y) 

♦define  max(x,y)  ((x)  >  (y))  ?  (x)  :  (y) 

typedef  unsigned  char  Pixel; 
enum  {LEFT  =  0,  RIGHT>; 

/♦  Global  variables:  ♦/ 

Pixel  leftImageCMAX„ROWS][MAX.COLS]; 

Pixel  rightImageCMAX.ROWS] [MAX.COLS] ; 
int  disparityMapCMAX.ROWS] [MAX.COLS] ; 
int  tempDisparityMapCMAX.ROWS] [MAX.COLS]  ; 

/♦  Function  prototypes:  */ 

/*  Basic  file  I/O  routines.  ♦/ 

s tat i c  int  ReadLeft  Image ( int  n) ; 

static  int  ReadRight Image (int  n)  ; 

static  void  WriteResultGrayScaleHapCint  n) ; 

/*  Simulated  annealing  functions.  */ 

static  int  RandomNevState(int  oldState); 
static  void  Anneal  (int  n) ; 

void  main  (int  argc,  char  ♦argvD); 


File  I/O  routines  follow. 

Static  int  ReadLeft Image (int  n)  { 

FILE  *theFile; 
char  fileName[80] : 
int  i,  j; 

/*  Attempt  to  read  the  left  image.  ♦/ 
printf ("Reading  left  image. \n"); 
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sprintf  (f  ileName ,  ”Lef  tDot /Cd .  c " ,  n)  ; 

theFile  =  f open (f ileName,  ”r”); 

if  (theFile  ==  NULL)  { 

perrorC'ReadLeft Image:  f open") ; 
return (FALSE) ; 

> 

for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

f  scanf  (theFile ,  '7,c'' ,  &(leftlmage  [i]  Cj3  ) ) ; 

f close (theFile) ; 

ret\im(TRUE) ; 

}  /*  end  ReadLeft Image  */ 


static  int  ReadRight Image  (int  n)  i 
FILE  ♦theFile; 
char  fileName[100] ; 
int  i,  j; 

/♦  Attempt  to  read  the  right  image.  ♦/ 

pr intf  (  "Reading  right  image .  \n"  ) ; 

sprintf  (f  ileName ,  “RightDot7,d .  c" ,  n)  ; 

theFile  =  f open (f ileName,  "r"); 

if  (theFile  ==  NULL)  { 

perror( "ReadRight Image:  fopen"); 
return  (FALSE) ; 

} 

for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

f  scanf  (theFile ,  “7,c" ,  ft  (right  Image  [i]  [j]  )  ) ; 

fclose(theFile) ; 

return (TRUE) ; 

}  /♦  end  ReadRight Image  ♦/ 


static  void  HriteResultGrayScaleMap(int  n)  •( 
FILE  ♦theFile; 
char  f ileName [80] ; 
int  i,  j; 

sprintf  (f ileName,  "C.SEQ.RESULTSJ/d" .  n)  ; 

theFile  =  f open (f ileName,  "w")  ; 

if  (theFile  ==  NULL)  { 

perror ( " Wr iteResult :  fopen" ) ; 
exit(l); 

} 


for  (i  -  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 
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fprintf (theFile,  "%c",  (Pixel) (disparityMapCi]  [j]  ♦  (255  /D_MAX))); 

if  (f close (theFile)  !=  0) 

perrorC'WriteResult :  f close”) ; 

}  /*  end  WriteResultGrayScaleMap  */ 


Stereo  matching  routines  follow. 

4c  }(c9tC)|c««9tc9|c]|c9t(:K4t;|(4e9tc  :tc*4c*)|c  He  ifc3|c«:(c«:(ci)c*j|e4;«>tcDc*4:*«>|c*>fe*i|c3K:(c3tci)c9|e  :4c  jfc/ 

Static  int  RandoinNevState(int  oldState) 

{ 

int  randomNumber; 

if  (RANDOM.DISPARITY  >  ((D.MAX  /  2)  -  1)) 
randomNumber  =  -1; 
else 

randomNumber  =  1; 

oldState  =  oldState  +  randomNumber; 

if  (oldState  <  D.MIN) 

OldState  =  D^MIN; 
else  if  (oldState  >  D_MAX) 

OldState  =  D_MAX; 

return  (oldState) 

}  /*  end  RandomNevState  */ 


static  void  RandomizeSystem(int  n)  •( 
int  i,  j; 

/*  Assign  a  random  disparity  to  each  disparity  grid  point.  Edges  get 
the  value  0.  ♦/ 

for  (i  =  0;  i  <  n;  i++) 

for  (j  =  0;  j  <  n;  j++)  { 

if  ((i  ==  0)  II  (i  ==  (n  -  D)  11  (j  ==  0)  II  (j  >=  (n  -  D.MAX))) 
disparityMapCi]  [j]  =0; 
else 

disparityMapCi]  [j]  =  RANDOM.DISPARITY; 

} 

>  /♦  end  RandomizeSystem  ♦/ 


static  int  Energy  (int  row,  int  col,  int  disparity)  { 
int  delta; 
int  photometric; 

delta  =  abs (disparity  -  disparityMap Crow- 1] [col- 1]  )  + 
abs (disparity  -  dispzurityMap Crow- 1] [col])  + 
abs (disparity  -  disparityMapCrow-1] [col+l] )  + 
abs (disparity  -  disparityMapCrow] [col-1]  )  + 
abs (disparity  -  dispar ityMap [row] [col+l]  )  + 
abs  (disparity  -  dispeurityMapCrow+l]  [col-1])  + 
abs (disparity  -  disparityMap[row+l] [col]  )  + 
abs (disparity  -  disparityMapCrow+1] [col+l]) ; 

photometric  =  abs (left Image [row] [col+dispar it y]  -  rightImageCrow] [col]) ; 
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return (photometric  +  (LAMBDA  ♦  delta)); 
}  /♦  end  Energy  ♦/ 


static  void  AnneaKint  n)  { 

int  newState,  newEnergy,  oldEnergy,  i,  j; 
double  currentTemp; 
int  scanCounter; 
int  deltaEnergy; 

RandomizeSystem(n) ; 

currentTemp  =  STARTING^TEMP ; 

while  (currentTemp  >=  ENDING.TEMP)  { 

printf ("Temp  =  ’/.fXn",  currentTemp); 

for  (scanCounter  =  0;  scanCounter  <  LATTICE. SCANS;  scanCounter++)  *C 
for  (i  =  1;  i  <  (n  -  1) ;  i++) 

for  (j  =  1;  3  <  (n  -  D.MAX);  j++)  -C 

newState  =  RandomNewState(disparityMapCi3  [j])  ; 
oldEnergy  =  Energy (i,  j,  disparityMapCi] [j] ) ; 
newEnergy  =  Energy (i,  j,  newState); 
deltaEnergy  =  newEnergy  -  oldEnergy; 
if  (deltaEnergy  <  0) 

tempDisparityMapCi] [j]  =  newState; 
else  if  (RANDOM.PROBABILITY  < 

exp( (double) -deltaEnergy/currentTemp) ) 
tempDispaurityMapCi]  Cj]  =  newState; 
else 

tempDisparityMapCi]  [j]  =  disparityMapCi]  Ej]  ; 

> 


for  (i  =  1;  i  <  (n  -  1);  i++) 

for  (j  =  1;  j  <  (n  -  D.MAX);  j++) 

disparityMapCi] Cj]  =  tempDisparityMapCi] Cj] ; 


> 

currentTemp  -=  (currentTemp  ♦  TEMP .REDUCTION) ; 

> 

>  /*  end  Anneal  */ 


void  main(int  argc,  chao:  *argvC]) 
int  n,  i; 

struct  timespec  tl,  t2; 
double  startTime,  endTime; 


MPI_Init(fcargc,  ftargv) ; 
n  =  atoi(argvCl]) ; 

if  (ReadLef t Image  (n)  tt  ReadRightlmage(n))  < 
getclock(TIMEOFDAY.  ktl) ; 

StartTime  =  (tl.tv.sec  *  100.0)  +  (tl.tv.nsec  ♦  lOe-7); 
StartTime  =  MPI.WtimeO; 

Anneal (n) ; 

getclock(TIME0FDAY,  fct2) ; 

endTime  -  (t2.tv.sec  ♦  100.0)  +  (t2.tv.nsec  ♦  lOe-7) ; 
printf ("mpi  */lf\n",  MPI.WtimeO  -  startTime); 

> 


26 


printf ("Elapsed  time  =  %.2f  seconds\n",  (endTime  -  startTime)  /  100.0) 
WriteResultGrayScaleMapCn) ; 

MPI.FinalizeO : 

}  /♦  end  main  */ 
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C  MPI  1"D  row  decomposition. 


tinclude  <stdio.h> 
tinclude  <mpi.h> 
tinclude  <math.h> 
tinclude  <sys/time.h> 

tdefine  MAX.ROWS  1000 
tdefine  MAX.COLS  1000 
tdefine  D_MIN  0 
tdefine  D.MAX  6 
tdefine  LAMBDA  5 
tdefine  STARTING.TEMP  100.0 
tdefine  ENDING.TEMP  1.0 
tdefine  TEMP.REDUCTION  0.1 
tdefine  LATTICE.SCANS  10 

/♦  Define  various  macros  and  functions  to  inline  code.  */ 

tdefine  RANDOM.DISPARITY  (randO  */.  D.MAX) 
tdefine  RANDOM.PROBABILITY  (randO  %  32767  /  32767.0) 
tdefine  min(x,y)  ((x)  <  (y))  ?  (x)  :  (y) 
tdefine  max(x,y)  ((x)  >  (y))  ?  (x)  :  (y) 

typedef  unsigned  char  Pixel; 
enum  {LEFT  =  0,  RIGHT}; 

/♦  Global  variables:  ♦/ 

Pixel  leftImage[MAX_R0WS3 [MAX.COLS] ; 

Pixel  right Image [MAX.ROWS] [MAX.COLS] ; 
int  disparityMapCMAX.ROWS]  [MAX.COLS]  ; 
int  tempDisparityMap[MAX.ROWS]CMAX.COLS]; 

/*  Function  prototypes:  »/ 

/♦  Basic  file  I/O  routines.  */ 

static  int  ReadLeft Image (int  n) ; 

static  int  ReadRight Image (int  n)  ; 

static  void  WriteResult6rayScaleMap(int  n); 

/*  MPI  communication  routines.  */ 

static  void  GetImageDataFromMaster(int  n,  int  my  ID,  int  startRow,  int  stopRow) ; 
static  void  SendResultsToMaster(int  n,  int  my ID,  int  startRow,  int  stopRow) ; 
static  void  CollectResults(int  n,  int  numProcs,  int  chunkSize); 

/*  Simulated  annealing  fimctions.  ♦/ 

static  int  RandomNevState(int  oldState); 

static  void  Anneal  (int  n,  int  mylD,  int  startRow,  int  stopRow,  int  numProcs); 
void  main(int  argc,  char  *argvn); 


File  I/O  routines  follow. 

static  int  ReadLeft  Image  (int  n)  { 
FILE  ♦theFile; 
char  f ileName[80] ; 
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int  i,  j; 

/*  Attempt  to  read  the  left  image.  */ 

sprintf (f ileName,  "LeftDotXd.c” ,  n); 

theFile  =  f open (f ileName,  "r”); 

if  (theFile  ==  NULL)  { 

perrorC'ReadLeft Image :  f open") ; 
return (FALSE) ; 

} 

for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

fscanf  (theFile,  "iCc",  ft(leftImageCi]  [j] ))  ; 

f close (theFile) ; 

return (TRUE) ; 

>  /♦  end  ReadLeft Image  */ 


static  int  ReadRightImage(int  n)  { 

FILE  ★theFile; 
char  fileNameClOO] ; 
int  i,  j; 

/★  Attempt  to  read  the  right  image.  ★/ 

sprintf (f ileName ,  "RightDotXd . c" ,  n) ; 

theFile  =  fopen(f ileName,  "r") ; 

if  (theFile  ==  NULL)  { 

perror  (  "ReadRight Image :  f open"  ) ; 
return  (FALSE) ; 

} 

for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

fscanf (theFile ,  "%c" ,  &(rightlmage [i] [j] ) ) ; 

f close (theFile) ; 

return (TRUE) ; 

>  /*  end  ReadRight Image  ♦/ 


static  void  WriteResultGrayScedeMap(iht  n)  { 
FILE  ♦theFile; 
char  f ileName [80] ; 
int  i,  j; 

sprintf  (f  ileName,  "MPI.ROWS.RESULTS J/,d" ,  n)  ; 

theFile  =  fopen(f ileName,  "w"); 

if  (theFile  -=  NULL)  < 

perror ( "Writ eRe suit :  fopen"); 
exit(l) ; 

> 
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for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

fprintfCtheFile,  '7.c",  (Pixel) (disparityMap [i]  [j]  *  (255  /D_MAX))); 
fclose(theFile) ; 

>  /*  end  WriteResultGrayScaleMap  ♦/ 


MPI  communication  routines  follow. 

********^**:^*******^*t*****ilF*^*******n(****************ilfHi******;¥********^***/ 

Static  void  CollectResults(int  n,  int  numProcs,  int  chunkSize)  ■( 
int  i,  source; 

MPI.Status  status ; 

for  (i  =  chunkSize;  i  <  n;  i++)  { 

source  =  min(i  /  chunkSize,  numProcs  -  1); 

MPI„Recv(disparityMap[i] ,  n,  MPI.INT,  source,  source,  MPI.COMM. WORLD , 
fcstatus) ; 

> 

>  /*  end  CollectResults  ♦/ 


static  void  GetImageDataFromMaster(int  n,  int  my  ID,  int  startRow,  int  stopRow)  { 
MPI.Status  status; 
int  i; 

for  (i  =  StartRow;  i  <=  stopRow;  i++)  { 

MPI.Recv(leftImage[i],  n,  MPI.UNSIGNED.CHAR,  0,  LEFT,  MPI_COMM_WORLD , 
festatus) ; 

MPI.Recv(rightImage[i],  n,  MPI.UNSIGNED.CHAR,  0,  RIGHT,  MPI^COMM. WORLD, 
^status) ; 

> 

}  /♦  end  GetImageDataFromMaster  */ 


static  void  SendResultsToMaster(int  n,  int  my  ID,  int  startRow,  int  stopRow)  { 
int  i ; 

for  (i  =  StartRow;  i  <=  stopRow;  i++) 

MPI^Send (disparityMap [i] ,  n,  MPI^INT,  0,  mylD,  MPI.C0MM_W0RLD) ; 

}  /♦  end  SendResultsToMaster  ♦/ 


Stereo  matching  routines  follow. 

static  int  RandomNewState(int  oldState) 
int  randomNumber; 

if  (RANDDM.DISPARITY  >  ((D.MAX  /  2)  -  1)) 
randomNumber  =  -1; 
else 

randomNumber  =  1; 

oldState  =  oldState  +  randomNumber; 
if  (oldState  <  D.MIN) 
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oldState  =  D.MIN; 
else  if  (oldState  >  D.MAX) 
oldState  =  D.MAX; 

return (oldState) ; 

>  /*  end  RandomNewState  ♦/ 


static  void  RandomizeSystem(int  startRow,  int  stopRow,  int  n)  { 
int  i,  j; 

/*  Assign  a  random  disparity  to  each  disparity  grid  point.  Edges  get 
the  value  0.  ♦/ 

for  (i  =  StartRow;  i  <=  stopRow;  i++) 
for  (j  =  0;  j  <  n;  j++)  { 

if  ((i  ==  0)  II  (i  ==  (n  -  D)  II  (j  ==  0)  II  (j  >=  (n  -  D.MAX))) 
disparityMapEi]  Cj]  =  0; 
else 

disparityMapEi]  Cj]  =  RANDOM.DISPARITY; 

} 

}  /*  end  RandomizeSystem  ♦/ 


static  int  Energy (int  row,  int  col,  int  disparity)  { 
int  delta; 
int  photometric; 

delta  =  abs(disparity  -  dispar ityMapCrow-1] [col- 1])  + 
abs (disparity  -  disparityMap [row-1] [col] )  + 
abs(disparity  -  dispsurityMapCrow-l] [col+1])  + 
abs(disparity  -  disparityMap [row] [col- 1] )  + 
abs  (disparity  -  disp2LrityMap[row]  [col+1]  )  + 
abs(disparity  -  disparityMap [row+1] [col- 1])  + 
abs (disparity  -  disparityMap [row+1] [col] )  + 
abs (disparity  -  disparityMap [row+l] [col+1]) ; 

photometric  =  abs(leftImage[row] [col+disparity]  -  r ight Image [row] [col]  ) ; 

return (photometric  +  (LAMBDA  *  delta)); 

}  /*  end  Energy  ♦/ 


static  void  Anneal  (int  n,  int  mylD,  int  startRow,  int  stopRow,  int  numProcs)  •{ 
int  newState,  newEnergy,  oldEnergy,  i,  j,  start,  stop; 
double  currentTemp; 
int  scanCounter;  , 
int  deltaEnergy; 

MPI.Status  status; 
struct  timespec  tl,  t2; 
double  startTime,  endTime; 

start  =  StartRow; 
stop  =  StopRow; 

if  (mylD  ==  0) 

start  =  StartRow  +  1; 
else  if  (my ID  ==  (numProcs  -  1)) 
stop  =  StopRow  -  1; 

RandomizeSystem(startRow,  stopRow,  n); 
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currentTemp  =  STARTING.TEMP ; 
while  (currentTemp  >=  ENDING.TEMP)  •( 

if  (mylD  ==  0) 

printf(”Temp  =  7,f\n",  currentTemp); 

for  (scanCounter  =  0;  scanCounter  <  LATTICE_SCANS ;  scanCounter++)  { 

/♦  Send  my  disparity  values  to  my  neighbors  and  get  values  in.  */ 

if  (mylD  “  0)  { 

MPI.Send(disparityMap[stopRow] ,  n,  MPI.INT,  my  ID  +  1,  my  ID, 
MPI.COMM.WORLD) ; 

MPI.Recv (dispar ityMap [stopRow+l]  ,  n,  MPI.INT,  mylD  +  1,  mylD  +  1, 
MPI.COMM.WORLD,  ftstatus) ; 

} 

else  if  (mylD  ==  numProcs  -  1)  { 

MPI_Send(disparityMapCstartRow] ,  n,  MPI.INT,  my  ID  -  1,  mylD, 
MPI.COMM^WORLD) ; 

MPI_Recv(disparityMapCstartRow-l]  ,  n,  MPI_INT,  mylD  -  1,  mylD  -  1 
MPI.COMM.WORLD,  fcstatus); 

} 

else  { 

MPI_Send(disparityMap[startRow3 ,  n,  MPI.INT,  my  ID  -  1,  my  ID, 
MPI^C0MM.W0RLD) ; 

MPI_Send(disparityMapCstopRow] ,  n,  MPI_INT,  my  ID  +  1,  my  ID, 
MPI^C0MM.W0RLD) ; 

MPI.Recv(disparityMap[startRow-l] ,  n,  MPI.INT,  my  ID  -  1,  mylD  -  1 
MPI^COMM^WORLD,  fcstatus); 

MPI.Recv(disparityMap[stopRow+l3  ,  n,  MPI.INT,  my  ID  +  1,  mylD  +  1, 
MPI_COMM.WORLD,  fcstatus); 

} 

/♦  getclock(TIMEOFDAY,  fctl); 

startTime  =  (tl.tv.sec  ♦  100.0)  +  (tl,tv_nsec  *  lOe-7) ;  */ 

for  (i  =  start;  i  <=  stop;  i++) 

for  (j  =  1;  j  <  (n  -  D.MAX) ;  j++)  { 

newState  =  RandomNewState  (disparityMap  [i]  [j  ]  )  ; 
oldEnergy  =  Energy(i,  j,  disparityMap [i] Ej] ) ; 
newEnergy  =  Energy (i,  j,  newState); 
deltaEnergy  =  newEnergy  -  oldEnergy; 
if  (deltaEnergy  <  0) 

tempDisparityMapCi] [j]  =  newState; 
else  if  (RANDOM^PROBABILITY  < 

exp ( (double) -deltaEnergy/ currentTemp) ) 
tempDispeirityMapCi]  [j]  =  newState; 
else 

tempDisparityMapCi]  [j]  =  disparityMap [i]  [j]  ; 

> 


for  (i  =  start;  i  <=  stop;  i++) 

for  (j  =  1;  j  <  (n  -  D.MAX);  j++) 

disparityMap [i] [j]  =  tempDisparityMapCi] Cj] ; 

/♦ 

getclock(TIME0FDAY,  fct2) ; 

endTime  «  (t2.tv.sec  ♦  100.0)  +  (t2.tv.nsec  ♦  lOe-7) ; 
if  (my ID  ==  1) 

printf ("Elapsed  time  =  %.2f  seconds\n*',  (endTime  -  startTime)  /  100.0); 
♦/ 


} 

currentTemp  -=  (currentTemp  *  TEMP.REDUCTION) ; 

} 

/*  If  I  am  not  the  boss,  send  my  results  back  to  the  boss.  ♦/ 
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if  (my ID  !=  0) 

SendResultsToMasterCn,  mylD,  startRov,  stopRow) ; 
}  /*“  end  Anneal  */ 


void  mainCint  aurgc,  cheu:  *2u:gv[])  { 
int  nunProcs,  my ID,  n; 
int  startRow,  stopRov; 
int  chunkSize,  destination; 
int  i; 
double  tl; 

MPI_Init(&argc,  ftargv); 

MPI.Comm.size(MPI.COMM_WORLD,  fcnumProcs) ; 

MPI.Comm.rankCMPI.COMM.WORLD,  fanylD); 

n  =  atoi(argv[l]) ; 

/*  Determine  the  start  and  stop  rows.  ♦/ 

chunks ize  =  (n  /  numProcs) ; 
startRow  =  (chunkSize  ♦  my  ID); 
stopRow  =  (startRow  +  chunkSize)  -  1; 
if  (my ID  ==  numProcs  -  1) 

StopRow  =  max (stopRow,  n-1) ; 

if  (mylD  ==  0)  i 

/*  Read  in  the  left  and  right  image  files.  ♦/ 

if  (Re adLeft Image (n)  ReadRightlmage(n))  { 

/♦  Send  the  image  data  to  the  different  processors.  Processor  0 
gets  the  first  part  of  the  image.  */ 

for  (i  =  chunkSize;  i  <  n;  i++)  i 

destination  =  min(i  /  chunkSize,  numProcs  -  1); 
MPI.Send(leftImage[i] ,  n,  MPI.UNSIGNED.CHAR,  destination,  LEFT, 
MPI^COMM.WORLD) ; 

MPI„Send(rightImageCi3 ,  n,  MPI.UNSI6NED.CHAR,  destination, 
RIGHT,  MPI.COMM.WORLD) ; 

} 

> 

tl  =  MPI.WtimeO; 

AnneaKn,  mylD,  startRow,  stopRow,  numProcs); 

printf ("Total  time  for  row  decomposition  %d  =  y,.2f  seconds\n'',  n, 
MPI.WtimeO  -  tl); 

/♦  Collect  results  from  the  workers.  */ 

CollectResults(n,  niunProcs,  chunkSize); 

VriteResultGrayScaleMap(n) ; 

} 

else  { 

/*  Gather  information  from  the  boss.  ♦/ 

GetImageDataFromMaster(n,  my  ID,  startRow,  stopRow); 

AnneaKn,  mylD,  startRow,  stopRow,  numProcs); 
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> 

MPI.FinalizeC) ; 

}  /*  end  main  */ 
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INTENTIONALLY  LEFT  BLANK. 
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D  MPI  1-D  column  decomposition 


tinclude  <stdio,h> 
tinclude  <mpi.h> 
tinclude  <math.h> 

♦define  MAX^ROWS  1000 
♦define  MAX.COLS  1000 
♦define  D.MIN  0 
♦define  D.MAX  6 
♦define  LAMBDA  5 
♦define  STARTING.TEMP  100.0 
♦define  ENDING.TEMP  1.0 
♦define  TEMP ^REDUCTION  0.1 
♦define  LATTICE.SCANS  10 

/♦  Define  various  macros  and  functions  to  inline  code.  ♦/ 

♦define  RANDOM.DISPARITY  (randO  7.  D.MAX) 

♦define  RANDOM.PROBABILITY  (randO  7.  32767  /  32767.0) 

♦define  min(x,y)  ((x)  <  (y))  ?  (x)  :  (y) 

♦define  max(x,y)  (Cx)  >  (y))  ?  (x)  :  (y) 

typedef  unsigned  char  Pixel; 
enum  <LEFT  =  0,  RIGHT}; 

/*  Global  variables:  ♦/ 

Pixel  leftImageCMAX_ROWS][MAX_COLS]; 

Pixel  rightImageCMAX.ROWS][MAX_COLS]; 
int  disparityMap[MAX.ROWS][MAX^COLS]; 
int  tempDisparityMapCMAX.ROWS]  [MAX.COLS]  ; 

/♦  Function  prototypes:  */ 

/*  Basic  file  I/O  routines.  */ 

static  int  ReadLeft Image (int  n)  ; 

static  int  ReadRightlxoageCint  n); 

static  void  WriteResultGrayScaleMapCint  n) ; 

/♦  MPI  communication  routines.  ♦/ 

static  void  GetImageDataFromMasterCint  n,  int  mylD,  int  numProcs,  int  startCol,  int  stopCol) ; 
static  void  SendResultsToMaster(int  n,  int  my  ID,  int  startCol,  int  stopCol) ; 
static  void  Collect Re suits (int  n,  int  numProcs,  int  chunkSize) ; 

/♦  Simulated  annee^ing  functions.  */ 

static  int  RandomNevState(int  oldState) ; 

static  void  AnneaKint  n,  int  my  ID,  int  startCol,  int  stopCol,  int  numProcs); 
void  main(int  aurgc,  cheir  ♦argvC]); 


File  I/O  routines  follow. 

static  int  ReadLeft Image (int  n)  i 
FILE  *theFile ; 
char  fileName[80] ; 
int  i,  j; 
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/♦  Attempt  to  read  the  left  image.  */ 

sprin^f (f ileName ,  "LeftDotXd . c” ,  n) ; 

theFile  =  f open (f ileName,  “r"); 

if  (theFile  —  NULL)  { 

perrorC'ReadLeft Image :  f  open”)  ; 
ret\im(FALSE) ; 

> 

for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

f  scanf  (theFile ,  ”7,c” ,  &(lef timage  [i]  [j]  ) )  ; 

f close (theFile) ; 

return (TRUE) ; 

y  /*  end  ReadLef timage  */ 


static  int  ReadRight Image  (int  n)  {, 

FILE  *theFile; 
char  f ileName [100]; 
int  i,  j; 

/♦  Attempt  to  read  the  right  image.  ♦/ 

sprintf(f ileName,  "RightDotXd.c",  n); 

theFile  =  f open (f ileName,  "r"); 

if  (theFile  —  NULL)  { 

perror( "ReadRight Image :  f open" ) ; 
return (FALSE) ; 

} 

for  (i  =  0;  i  <  n;  i++) 
for  (j  =  0;  j  <  n;  j++) 

f  scanf  (theFile ,  "y,c" ,  ft  (right  Image  [i]  [j]  )) ; 

f close (theFile) ; 

return (TRUE) ; 

}  /*  end  ReadRight Image  */ 


static  void  WriteResultGrayScaleMap(int  n)  *[ 
FILE  ♦theFile; 
cheir  f  ileName  [80]  ; 
int  i,  j; 

sprintf(f ileName,  ”MPI.C0LS.RESULTS.y.d" ,  n)  ; 

theFile  =  f open (f ileName,  "w") ; 

if  (theFile  ==  NULL)  { 

perrorC’WriteResult :  f open") ; 
exit(l); 

} 


for  (i  =  0;  i  <  n;  i++) 
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for  (j  =  0;  j  <  n;  j++) 

fprintf  (theFile,  (Pixel) (disparityMap[i]  [j]  *  (255  /D.MAX))); 

fclos€j(theFile) ; 

}  /*  end  WriteResultGrayScaleMap  */ 


MPI  conmimication  routines  follow. 

4c  itc  sterile 

Static  void  CollectResults(int  n,  int  numProcs,  int  chunkSize)  { 
int  i,  source; 

MPI.Status  status; 

MPI^Datatype  column; 

MPI.Type,vector(n,  1,  MAX.COLS,  MPI.INT,  fccolumn) ; 
MPI_Type.commit(fccolumn) ; 

for  (i  =  chunkSize;  i  <  n;  i++)  { 

source  =  min(i  /  chunkSize,  numProcs  -  1); 

MPI^Recv (ft (dispar ityMap [0]  [i])  ,  1,  column,  source,  source, 
MPI.COMM^WORLD,  ftstatus); 

} 

}  /*  end  CollectResults  */ 


static  void  GetImageDataFromMaster(int  n,  int  my  ID,  int  numProcs,  int  startCol,  int  stopCol)  { 
MPI.Status  status; 
int  i; 

MPI^Datatype  column; 

MPI_Type.vector(n,  1,  MAX.COLS,  MPI^UNSIGNED.CHAR,  fccolumn); 

MPI.Type_commit (fccolumn) ; 

for  (i  =  StartCol;  i  <=  stopCol;  i-H-)  { 

MPI.Recv(fc(leftImageCO]  Ci]),  1,  column,  0,  LEFT,  MPI^COMM. WORLD, 
fcstatus) ; 

MPI.Recv(fc(rightImage[0]  Ci]),  1,  column,  0,  RIGHT,  MPI.C0MM_W0RLD , 
fcstatus) ; 

> 

/♦We  have  to  receive  more  of  the  left  image  since  the  disparity  is 

by  columns  and  we  havenH  received  all  the  possible  matched  points.  ♦/ 

if  (my ID  !=  (numProcs  -  1)) 

for  (i  =  StopCol  +1;  i  <=  stopCol  +  D.MAX;  i++) 

MPI_Recv(fc(leftImageCO]  [i]) ,  1,  column,  0,  LEFT,  MPI„C0MM_W0RLD , 
fcstatus) ; 

}  /♦  end  GetImageDataFromMaster  ♦/ 


static  void  SendResultsToMaster(int  n,  int  my  ID,  int  startCol,  int  stopCol)  { 
int  i; 

MPI.Datatype  column; 

MPI.Type_vector(n,  1,  MAX.COLS,  MPI^INT,  fccolumn); 

MPI.Type_commit  (fccolumn) ; 

for  (i  =  StartCol;  i  <=  stopCol;  i++) 

MPI_Send(fc(disparityMapCO]  Ci3)  ,  1,  column,  0,  my  ID,  MPI.COMM^WORLD)  ; 
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}  /♦  end  SendResultsToMaster  ♦/ 


Stereo  matching  routines  follow. 

Static  int  RandomNevState(int  oldState) 

{ 

int  randomNumber ; 

if  (RANDOM.DISPARITY  >  ((D_MAX  /  2)  -  1)) 
randomNumber  =  -1; 
else 

randomNumber  =  1; 

oldState  ==  oldState  randomNumber; 

if  (oldState  <  D.MIN) 
oldState  =  D.MIN; 
else  if  (oldState  >  D.MAX) 
oldState  =  D.MAX; 

return (oldState) ; 

}  /*  end  RandomNewState  ♦/ 


static  void  RandomizeSystem(int  startCol,  int  stopCol,  int  n)  { 
int  i,  j; 

/*  Assign  a  random  disparity  to  each  dispeirity  grid  point.  Edges  get 
the  value  0.  ♦/ 

for  (i  =  0;  i  <  n;  i++) 

for  (j  =  StartCol;  j  <=  stopCol;  j++)  { 

if  ((i  ==0)  II  (i  ==  (n  -  D)  II  (j  =  0)  II  (j  >=  (n  -  D_MAX))) 
dispaurityMapCi]  [j]  =  0; 
else 

disparityMapCi]  [j]  =  RAND0M_DISPARITY; 

} 

}  /♦  end  RandomizeSystem  */ 


static  int  Energy (int  row, 
int  delta; 
int  photometric; 

delta  =  abs (disparity  - 
abs (disparity  - 
abs (disparity  - 
abs (disparity  - 
abs (disparity  ^ 
abs  (dispaurity  - 
abs  (disparity  - 
abs  (dispaority  - 


int  col,  int  disparity)  { 


dispar ityMap [row- 1] [col-l] )  + 
disparityMap [row-1] [col] )  + 
di spaur ityMap  [row- 1]  [col+1]  )  + 
disparityMap [row] [col-1] )  + 
dispairityMap  [row]  [col+1]  )  + 
disparityMap  [row+1]  [col-1]  )  + 
disparityMap  [row+1]  [col]  )  + 
disparityMap  [row+1]  [col+1]  ) ; 


photometric  =  abs (left Image [row] [col+disparity]  -  rightImage[row]  [col]) ; 


return (photometric  +  (LAMBDA  *  delta)); 


>  /*  end  Energy  ♦/ 
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static  void  AnnealCint  n,  int  my  ID,  int  startCol,  int  stopCol,  int  numProcs)  *C 
int  nevState,  nevEnergy,  oldEnergy,  i,  j,  start,  stop; 
double  currentTemp; 

.int  scanCounter; 
int  deltaEnergy; 

MPI^Status  status; 

MPI_Datatype  column; 

start  =  StartCol; 
stop  =  StopCol; 
if  (mylD  ==  0) 

start  =  StartCol  +1; 
else  if  (my ID  ==  (numProcs  -  1)) 

Stop  =  (n  -  D.MAX)  -  1; 

MPI„Type.vector(n,  1.  MAX.COLS,  MPI.INT,  fccolumn) ; 

MPI„Type.commit(&column) ; 

Randomi2eSystem( StartCol,  stopCol,  n) ; 

currentTemp  =  START ING.TEMP ; 

while  (currentTemp  >=  ENDING^TEMP)  *C 

if  (mylD  ==  0) 

printfC'Temp  =  %f\n*',  currentTemp); 

for  (scanCounter  =  0;  scanCounter  <  LATTICE_SCANS ;  scanCounter++)  ■( 

/*  Send  my  disparity  values  to  my  neighbors  and  get  values  in.  ♦/ 

if  (mylD  ==  0)  { 

MPI_Send(fe(disparityMap[0]  [stopCol]),  1,  column,  my  ID  +  1,  my  ID, 
MPI.COMM.WORLD) ; 

MPI_Recv(&(disp2irityMap[0]  [stopCol+1] ) ,  1,  column,  mylD  +  1, 
my ID  +  1,  MPI.COMM_WORLD,  fcstatus) ; 

> 

else  if  (mylD  -=  numProcs  -  1)  { 

MPI_Send(fc(disparityMap[0]  [startCol]) ,  1,  column,  my  ID  -  1,  my  ID, 
MPI.COMM.WORLD) ; 

MPI_Recv(fc(disparityMap[0]  [startCol-1]) ,  1,  column,  mylD  -  1, 
my ID  -  1,  MPI_COMM_WORLD,  fcstatus); 

> 

else  { 

MPI_Send(&(disparityMap[0]  [started]) ,  1,  column,  my  ID  -  1,  mylD, 
MPI.COMM.WORLD) ; 

MPI.Send  (fc(di  spar  it  yMap[0]  [stopCol])  ,  1,  column,  mylD  +  1,  mylD, 
MPI.C0MM_W0RLD) ; 

MPI.Recv (fc (dispar ityMap[0]  [start Col- 1])  ,  1,  column,  mylD  -  1, 
mylD  -  1,  MPI.COMM.WORLD,  fcstatus); 

MPI.Re cv(fc( dispar ityMap [0]  [stopCol+1])  ,  1,  column,  mylD  +  1, 

my  ID  +  1,  MPI.COMM.WORLD,  fcstatus); 

> 

for  (i  =  1;  i  <  (n  -  1);  i++) 

for  (j  =  steurt;  j  <=  stop;  j++)  { 

newState  =  RandomNewState(disparityMap[i]  [j]) ; 

OldEnergy  =  Energy(i,  j,  disparityMap[i]  [j]) ; 
newEnergy  =  Energy (i,  j,  newState); 
deltaEnergy  =  newEnergy  -  oldEnergy; 
if  (deltaEnergy  <  0) 

tempDispeLrityMap[i]  [j]  =  newState; 
else  if  (RANDOM.PROBABILITY  < 

exp ( (double) -deltaEnergy/currentTemp) ) 
tempDisparityMap[i]  [j]  =  newState; 
else 

tempDisparityMap[i]  [j]  =  disparityMap[i]  [j]  ; 


41 


> 


for  (i  =  1;  i  <  (n  -  1);  i++) 

for  (j  =  start;  j  <=  stop;  j++) 

disparityMap[i]  [j]  =  tempDisparityMapCi]  [j]  ; 


} 

current Temp  -=  (currentTemp  *  TEMP.REDUCTION) ; 

} 

/*  If  I  am  not  the  boss,  send  my  results  back  to  the  boss.  ♦/ 
if  (my ID  !=  0) 

SendResultsToMasterCn,  my  ID,  startCol,  stopCol); 

}  /*  end  Anneal  */ 


void  main(int  argc,  char  *argvC])  { 
int  numProcs,  my  ID,  n; 
int  steLTtCol,  stopCol; 
int  chunkSize,  destination; 
int  i,  j,  oldDestination; 
double  tl; 

MPI.Datatype  column; 

MPI.InitC&argc,  ftargv) ; 

MPI_Comm„size(MPI.COMM_WORLD,  ftnumProcs) ; 

MPI_Coiiim„rank(MPI.COMM_WORLD,  fanylD) ; 

n  =  atoi(argv[l]) ; 

/♦  Determine  the  start  and  stop  columns.  ♦/ 

chunkSize  =  (n  /  numProcs) ; 

StartCol  =  (chunkSize  ♦  my ID); 

StopCol  =  (startCol  +  chunkSize)  -  1; 
if  (my ID  ==  numProcs  -  1) 

StopCol  =  max(stopCol,  n-1) ; 

if  (mylD  ==  0)  { 

/*  Read  in  the  left  and  right  image  files.  ♦/ 

if  (ReadLeftlmage(n)  ReadRightlmage(n))  { 

/♦  Send  the  image  data  to  the  different  processors.  Processor  0 
gets  the  first  part  of  the  image.  ♦/ 

MPI.Type.vector(n,  1,  MAX.COLS,  MPI.UNSIGNED.CHAR,  fccolumn) ; 

MPI „Type_ commit  (ftcolumn)  ; 

oldDestination  -  1; 

for  (i  =  chunkSize;  i  <  n;  i++)  ■( 

destination  =  min(i  /  chunkSize,  numProcs  -  1); 
if  (destination  !~  oldDestination)  { 

for  (j  =  i  -  1;  j  <=  (i  -  1)  +  D^MAX;  j++) 

MPI.Send(&(leftImageCO] [i] ) ,  1,  column,  oldDestination, 
LEFT,  MPI.COMM.WORLD) ; 
oldDestination  -  destination; 

> 

MPI_Send(&(leftImageCO]  [i]) ,  1,  column,  destination,  LEFT, 
MPI.COMM.WORLD) ; 

MPI_Send(fc(rightImageCO] Ci]),  1,  column,  destination,  RIGHT, 
MPI.COMM.WORLD) ; 
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} 


} 

tl  =  MPI.WtimeC); 

AnneaKn,  mylD,  startCol,  stopCol,  ntunProcs) ; 

printf ("Total  time  for  column  decomposition  y,d  =  5C.2f  seconds\n",  n, 
MPI.WtimeC)  -  tl) ; 

/♦  Collect  results  from  the  workers.  */ 

CollectResultsCn,  numProcs,  chunkSize); 

WriteResultGrayScaleMap(n) ; 

} 

else  *C 

/*  Gather  information  from  the  boss.  */ 

GetImageDataFromMaster(n,  mylD,  numProcs,  startCol,  stopCol); 
AnneaKn,  mylD,  startCol,  stopCol,  numProcs); 

} 

MPI.FinalizeO ; 

}  /*  end  main  */ 
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E  FORTRAN  and  HPF  code. 


nodule  timer 
contains 

subroutine  print .time 

character*8  : :  date  !  ccyymmdd 

character*10  ::  time  !  hhnmss.sss 

call  date.and. time (date, time) 

print  *,  date(5:6)// W/date(7:8)// V*//date(3:4) 
print  *,  time(l:2)//’ : *//time(3:4)//* : *//time(5:10) 
return 

end  subroutine  print .time 

real  function  cputimeO 
cputime  -  secnds(O.O) 
return 

end  function  cputime 

end  module  timer 

program  rows 
use  timer 

implicit  none 

integer  MAX.ROWS,  MAX.COLS 

parameter  (MAX.ROWS  =  600,  MAX.COLS  =  600) 

integer  D.MIN,  D.MAX 

parameter  (D.MIN  =  0,  D.MAX  =  6) 

integer  LAMBDA 

parameter  (LAMBDA  =  5) 

integer  SCANS 

parameter  (SCANS  «  10) 

real  STARTING.TEMP ,  ENDING.TEMP,  TEMP.REDUCTION 
parameter  (STARTING.TEMP  =  100.0,  ENDING.TEMP  =1.0) 
parameter  (TEMP.REDUCTION  =0.1) 
double  precision  start,  stop 

integer  dMap(MAX_ROWS,  MAX.COLS) 
integer,  dimension( : , : ) ,  allocatable  ::  leftimage 
integer,  dimens ion( :,:) ,  allocatable  ::  right Image 
integer,  dimens ion( :,:) ,  allocatable  ::  nevDMap 
integer,  dimens ion( :,:) ,  allocatable  ::  oldEnergies 
integer,  dimension(: , :) ,  allocatable  ::  nevEnergies 
real,  dimens ion(: ,:) ,  allocatable  ::  randoms 
!hpf$  distribute (block,  *)  ::  dMap 
!hpf$  align  with  dMap  : :  randoms 

!hpf$  align  with  dMap  ::  nevEnergies,  oldEnergies,  nevDMap 
!hpf$  align  with  dMap  leftimage,  right Image 
real  currentTemp 

integer  i,  j,  n,  scanCounter,  k,  1 

real  dMaxReal 

interface 

pure  integer  function  NewState(oldState,  D.MIN,  D.MAX) 
integer,  intent(in)  ::  oldState,  D.MIN,  D.MAX 
end  function  NevState 
end  Interface 
interface 

pure  real  function  randO 
end  function  rand 
end  interface 
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interface 

pure  integer  fiinction  irandO 
end  function  irand 
end  ‘interface 
interface 

pure  integer  function  GetD i spar ity( current ,  proposed, newE,oldE,  t) 
integer,  intent (in)  ::  current,  proposed,  newE,  oldE 
real,  intent (in)  ::  t 
end  function  GetDisp2u:ity 
end  interface 

allocate  (leftImage(MAX.ROWS.  MAX.COLS)) 
allocate  (right Image (MAX.ROWS,  MAX.COLS)) 
allocate  (newDMap (MAX.ROWS ,  MAX.COLS)) 
allocate  (randoms (MAX.ROWS .  MAX.COLS)) 
allocate  (oldEnergies (MAX.ROWS,  MAX.COLS)) 
allocate  (newEnergies (MAX.ROWS,  MAX.COLS)) 

n  =  600 

dMaxReal  =  D.MAX 

call  ReadLeft Image (n,  leftimage) 
call  ReadRightImage(n,  right  Image) 

start  =  secnds(O.O) 

!  Initialize  the  disparity  map 

call  random.number (randoms) 

dMapd,:)  =  0 
dMap(n, :)  =  0 
dMap(:,l)  *  0 
dMap(;,(n-D.MAX)+l:n)  =  0 
forall  (i=2:n-l,  j=2:n-D.MAX) 

dMap(i,j)  =  mod(randoms(i, j)  *  10000,  dMaxReal) 
end  forall 

currentTemp  =  STARTING.TEMP 

!  Start  the  annealing  process 

do  while  (currentTemp  >=  ENDING.TEMP) 
print  ♦ ,  *  Temp  =  * ,  currentTemp 
do  scanCounter  =  1, SCANS 

forall  (i=2:n-l,  j=2:n-D.MAX) 

newDMap (i,j)  -  NewState(dMap(i, j) ,  D.MIN,  D.MAX) 


oldEnergies (i,j)  =  ((abs(dMap(i“l, j-1)  -  dMap(i,j))  +  t 
ft  abs(dMap(i’l,j)  -  dMap(i,j))  +  ft 
ft  abs(dMap(i-l,j+l)  -  dMap(i,j))  +  ft 
ft  abs(dMap(i, j*l)  -  dMap(i,j))  +  ft 
ft  abs(dMap(i, j+1)  -  dMap(i,j))  +  ft 
ft  abs(dMap(i+l,j-l)  -  dMap(i,j))  +  ft 
ft  abs(dMap(i*t‘l, j)  -  dMap(i,j))  +  ft 
ft  abs(dmap(i+l,j+l)  -  dMap(i,j)))  *  ft 
ft  LAMBDA)  +  abs(leftImage(i,j+dMap(i,j))  -  ft 
ft  right Image (i,j)) 

newEnergies (i,j)  =  ((abs(dMap(i-l, j-1)  -  newDMap(i, j))  +  ft 
ft  abs(dMap(i-l,j)  -  newDMap(i, j))  +  ft 
ft  abs(dMap(i-l,j+l)  -  newDMap(i, j))  +  ft 
ft  abs(dMap(i, j-1)  -  newDMap(i, j))  +  ft 
ft  abs(dMap(i, j+1)  -  newDMap(i, j))  +  ft 
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ft  abs(dMap(i+l, j“l)  -  newDMap(i, j))  +  ft 

ft  abs(dMap(i+l, j)  -  EewDMapCi, j))  +  ft 

ft  abs(dmap(i+l, j+1)  -  newDMap(i, j) ))  *  ft 

ft  “  LAMBDA)  +  abs( lef t Image (i,j+newDMap(i,j))  -  ft 

ft  right Image (i,j)) 

randoms (i,j)  =  rand() 

end  for all 

forall  (i=2:n**l,  j=2:n-D.MAX,  (newEnergies(i, j)  -  ft 

ft  oldEnergiesCi, j)  <  0)  .or.,  (randoms (i,  j)  <  exp(  ft 

ft  (newEnergies(i, j)  -  oldEnergies ( i , j ) )  /  -currentTemp) ) ) 
dMapCiJ)  =  newDMap(i,j) 
end  forall 

enddo 

currentTemp  =  currentTemp  -  (currentTemp  *  TEMP.REDUCTION) 
enddo 

stop  =  secnds(O.O) 

print  ♦,  'Total  time  parallel  =  stop  -  start 

call  WriteResults(n,  dMap) 

end 


subroutine  ReadLeft Image (n»  left Image) 
implicit  none 
integer  n 

integer  left Image (n,  n) 
integer  i,  j 

open  (unit=10,  f ile='LeftDot600.f * ,  status='old') 
read  (10, ♦)  ((leftlmage(i, j) ,  j=l ,n) ,i=l,n) 
close  (10) 

return 

end 


subroutine  ReadRight Image (n,  right Image) 
implicit  none 
integer  n 

integer  right  Image  (n,  n) 
integer  i,  j 

open  (unit=10,  f ile='RightDot600.f ' ,  status='old') 
read  (10,*)  ((rightlmage(i, j) ,  j=l,n) ,i=l,n) 
close  (10) 

return 

end 


subroutine  WriteResults(n,  dispaurityMap) 
implicit  none 
integer  n 

integer  disparityMap(n,  n) 
integer  i,  j 
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open  (unit=10,  file='HPF.ROW„RESULTS’ ,  status=*new*) 

write  (10, ♦)  (((disparityMap(i, j)  *  (255  /  6)),  j=l,n) ,i=l,n) 

close  (10) 

end  subroutine  WriteRe suits 


pure  integer  function  NewState(oldState,  D_MAX) 

implicit  none 

intent (in)  ::  oldState,  D_MIN,  D.MAX 

integer  oldState 

integer  D.MIN,  D_MAX 

integer  randomNumber 

real  x 

interface 

pure  integer  function  irandO 
end  function  irand 
end  interface 


if  (mod(irand() .  D.MAX)  >  ((D.MAX  /  2)  -  1))  then 
randomNumber  =  -1 
else 

randomNumber  ~  1 
end  if 

NevState  =  oldState  ^  randomNumber 

if  (NevState  <  D.MIN)  then 
NewState  =  D.MIN 
else  if  (NewState  >  D.MAX)  then 
NevState  =  D.MAX 
end  if 

return 

end  function 


pure  integer  function  GetDispzirity (current,  proposed,  newE,  oldE,  t) 
implicit  none 

intent (in)  ::  current,  proposed,  newE,  oldE 
intent (in)  : :  t 

integer  current,  proposed,  nevE,  oldE 

real  t 

interface 

pure  real  function  randO 
end  function  rand 
end  interface 

if  ((newE  -  oldE)  <  0)  then 
GetDisparity  =  proposed 

else  if  (randO  <  cxp((-(newE-oldE))  /  t))  then 
GetDisparity  =  proposed 
else 

GetDisparity  =  current 
end  if 

return 
end  function 
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